### Visualization3: Top 10 cars which involve in car cashes?
#**Methodology -** To identify the top 10 car manufacturers involved in car accidents, we will generate a bar plot using ggplot2 in R. This plot will showcase the most common vehicle makes involved in accidents. We'll use dplyr for data summarization and arrangement to identify the top 10 manufacturers based on accident frequency.
### Hypothesis 2: Is there any relation between weather conditions, daylight, and injury severity?
#**Methodology -** To test this hypothesis, we will conduct a Pearson's Chi-squared test on a contingency table. This test will analyze the association between weather conditions, daylight, and injury severity. By examining the relationship between these variables statistically, we can determine if there's a significant correlation between them.
### Hypothesis 3: Are certain types of vehicles more likely to be involved in collisions at night compared to during the day?
#**Methodology -** To investigate this hypothesis, we will perform a Pearson's Chi-squared test comparing collision counts involving different types of vehicles during the day and at night. By statistically analyzing collision data, we can determine if certain vehicle types are more prone to accidents during specific times of the day.
### Hypothesis 4: Do drivers distracted by electronic devices have a higher rate of collisions compared to drivers distracted by other factors?
#**Methodology -** To explore this hypothesis, we will conduct a Chi-squared test comparing collision counts between drivers distracted by electronic devices and those distracted by other factors. By examining collision data and analyzing distraction-related factors, we can determine if electronic device distraction significantly impacts collision rates.
### Prediction #3: Can we predict the extent of vehicle damage in collisions based on collision type, vehicle movement, and speed limit?
#**Methodology -** To make predictions about vehicle damage extent, we will evaluate model performance using confusion matrix and key predictors. By analyzing collision data and training a predictive model, we can assess its accuracy in predicting the extent of vehicle damage based on collision type, vehicle movement, and speed limit.
Research Design
Visualization1: What are the trends in the frequency of car crashes over years?
Methodology - To analyze the trend of crash frequency over the years, we will use ggplot2 in R. This allows us to create visual representations of data. We’ll transform the Crash_Date column into a date format and group the data by year. By visualizing this data, we can identify any noticeable patterns or trends in crash frequency over time.
Hypothesis 1: Can we determine the effectiveness of different traffic control measures in reducing collision rates? Which types of traffic controls are most effective in preventing collisions?
Methodology - To address this hypothesis, we will conduct a Pearson’s Chi-squared test between traffic control measures and injury severity. By statistically analyzing data on traffic control measures and their impact on collision rates, we can assess the effectiveness of various measures and identify the most impactful ones for preventing collisions.
Prediction 1: Can we predict the severity of injuries based on various factors such as weather conditions, road surface conditions, and collision type?
Methodology - To make predictions about injury severity, we will utilize a random forest classification model trained on relevant dataset. This model will analyze various factors such as weather conditions, road surface conditions, and collision type to predict injury severity. By assessing the model’s accuracy and performance, we can determine its effectiveness in predicting injury severity.
Prediction 2: Finding the location of Crashes using the K-means algorithm?
Methodology - K-means is a method used to cluster data points into groups based on their similarity. Initially, it randomly selects points on a graph known as centroids, which represent the centers of the clusters. Then, each data point is assigned to the nearest centroid, forming an initial grouping. Afterward, the centroids are recalculated as the average position of all the points within each cluster. This process of assigning and updating centroids iterates until the centroids stabilize, meaning they don’t move much between iterations. At that point, the algorithm has identified the clusters, and each data point belongs to the group associated with the nearest centroid. Overall, k-means is a straightforward yet powerful technique for partitioning data into distinct groups based on their proximity to one another.
Visualization2: How does the distribution of car accidents vary geographically?
Methodology - To understand the geographical distribution of car accidents, we will utilize the Leaflet package in R. This package enables us to create interactive maps displaying crash locations and associated fatalities using latitude and longitude data. By visualizing this data on a map, we can identify any geographic patterns or hotspots where accidents are more prevalent.